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Abstract 

GC-content, the ratio of guanine and cytosine bases in an entire nucleotide sequence, and 
palindromic sequences are unique for every organism due to genomic evolution. The goals 
of our research was to establish a correlation between GC-content and palindromic densities 
in wild-type viral and randomly-generated genomes. Forty viral genomes were downloaded 
from GenBank and their GC-ratios and palindromic densities were calculated and plotted using 
Mathematica. The palindromic densities-by-GC-ratios plot of randomly generated sequences 
(palindromic density curve) exhibited a quadratic relationship and was superimposed over the 
viral genome plot. It was observed that the viral plots followed the c;urvature of the random 
sequences' quadratic curve, signifying a directly proportional relationship between GC-content 
and palindrome density in viral genomes. However, because viral genomes require certain non- 
palindromic sequences to function, the palindromic densities of most wild-type genomes were 
under the palindromic density curve. The variance in palindrome densities of wild-type genomes 
in respect to the random sequences' quadratic curve may be examined to determine evolutionary 
traits in genomes. A better understanding of viral palindromic densities and GC-ratios would 
help in understanding conserved secondary RNA structures in viral genomes and future drug 
discovery. In addition, certain viral genomes were found to be viable recombinant viruses, which 
are used in gene therapy. 



1 Introduction 

GC-content (guanine-cytosine content) refers 
to the percentage of a nucleotide sequence that is 
made up of either guanine or cytosine bases. Se- 
quences with a high GC-content tend to be more 
stable than sequences with lower GC-content due 
to stacking interactions — not the fact that the 
GC pair has three hydrogen bonds and the AT 
pair has two (Yakovchuk et al., 2006). Inter- 
estingly, higher GC-content is speculated to be 
associated with autolysis (or cell destruction by 
its own enzymes) which reduces cell longevity 
despite genetic thermostability (Levin & Van 
Sickle, 1976). 

GC-content varies throughout a genome and 
this variation is speculated to be driven by both 



selective and neutral processes (Pozzoli et al., 
2008; Nishida, 2012). Genetic recombination 
and genomic evolution is also directly affected 
by silent GC-content (Birdsell, 2002). Hence, 
low GC-content and AT mutational bias is char- 
acteristic of nonrecombining genomes. Due to 
the codon usage bias in shorter sequences (due 
to the fact that the stop codon is a higher AT 
bias), shorter sequences tend to have lower GC- 
contents while longer sequences have higher GC- 
contents (Wuitschick & Karrer, 1999). 

Due to the variation of GC-content in organ- 
isms, GC-content has been used as a method 

of classifying bacteria in higher level hierarchial 
classification (Wayne et al., 1987). 

DNA palindromes arc inverted repeats that 
read identically from the 5' to 3' end as from 



the 3' to 5' end. These reverse complemen- 
tary sequences, which examphfy dyad symme- 
try, differ from lexical palindromes, which are 
the exactly identical forwards and backwards. 
The frequency of short palindromes has been 
found to be useful in comparative genomics 
by distinguishing and typing species (Lamprea- 
Burgunder et al., 2011). 

Protein folding rates have been found to be 
affected by both the GC-content of palindromes 
as well as palindromic density (Li &; Li, 2010). 
However, it is not known whether GC-content 
itself is correlated with the density or occurence 
of DNA palindromes. 

2 Materials and Methods 

Forty viral genomes of different lengths were 
chosen from information provided by the RNA 
virus database (Belshaw et al., 2008), the In- 
ternational Committee on Taxonomy of Viruses 
database (ICTVdB) (Fauquet & Fargette, 2005), 
and the ViralZonc database (Hulo et al., 2011). 
All viral genomes were downloaded from NCBI 
GenBank in FASTA file format (Benson et al., 
2009). Their GC-contents and palindromic den- 
sities were calculated using a Java program (see 
Table 2). 

A program was written using the Java pro- 
gramming language to randomly generate ge- 
netic sequences based on a GC-ratio input. Be- 
cause the Random package provided by Oracle is 
not cryptographically secure, the SecureRandom 
package was used to randomly generate se- 
quences. The GC-content and palindromic den- 
sity of randomly generated sequences were used 
as controls to compare randomly-generated se- 
quences with wild-type viral genomes. 

The GC-content of a DNA sequence is calcu- 
lated by the sum of G and C bases divided by 
the total number of nucleotides. 

GC% = , ^ X 100 

A + T + G + C 

The resultant number is known as the GC-ratio 

of the sequence. Another program was written 
to calculate the GC-ratios of all randomly gen- 
erated sequences and viral genomes. 



A Java program was also written to count 
the occurence of perfect (unbroken) DNA palin- 
dromes and divide that number by the length 
of the input sequence (giving a palindrome den- 
sity) in viral genomes and randomly generated 
sequences. The palindrome density value is de- 
fined by the equation: 
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for which S is any nucleotide sequence, p{S) rep- 
resents the sum of palindromes found by the 
function P{S) (which is the Java program writ- 
ten to search for all palindromic sequences), 
i(P{S)) is the function that returns the length of 
a palindrome found by function P{S) (to deter- 
mine whether the palindrome is to be summed), 
and i{S) represents the length as a function 
of the inputted sequence. Palindromes with 
a length of 4 are considered to be the short- 
est palindromes because sequences lesser than 4 
would either be nucleotides or insignificant. 

It is important to note that this equation 
to find "palindromic density" provides an arbi- 
trary constant that simply describes the ratio 
of certain-length palindromes to the length of 
a sequence. The P{S) algorithm is a compu- 
tational program written in the Java program- 
ming language to find all palindromes in a given 
sequences. This includes finding palindromes 
within palindromes. Although this would seem 
to overestimate the number of palindromic se- 
quences, it gives a better account for palin- 
drome lengths — a factor not included in other 
palindrome-detecting algorithms which cannot 
differentiate a long palindrome from a shorter 
palindrome. 

The palindromic densities per GC-ratio of 
randomly generated sequences were calculated 
and plotted; there is a quadratic relationship 
between GC-content and palindromic densities 
— with the lowest palindromic density at 50% 
and the highest palindromic densities at the ex- 
tremas. Random sequences, by their nature, ex- 
hibit perfect palindromic density relationships 
with GC-content whereas wild-type sequences 
have specific GC-contents and palindromic se- 
quences (some of which are necessary for the 



2 



genome to function). 

In case of the probability of random chance 
occurences caused by the SecureRandom pack- 
age, 100 sequences were randomly generated per 
inputted length and their GC-content and palin- 
dromic densities were calculated. By the law of 
large numbers, the average of each trial's GC- 
content and palindromic density would be close 
to a stable number. 

3 Results 

Randomly generated sequences with a GC- 
content of 50% demonstrated the lowest number 
of palindromic sequences whereas sequences with 
GC-contents of 25% and 75% had equally higher 
numbers of palindromes (see Table 1). 

The Mathematica plots of the random se- 
quences' palindromes demonstrated that the 
palindromic sequences increased linearly as the 
length of randomly generated sequences in- 
creased (see Figure 1). The 25% and 75% plots 
exhibit identical slopes while the 50% plot has a 
more gradual plot with a smaller slope. 

GC-contents and palindromic densities 
of randomly generated sequences exhibit a 
quadratic relationship (sec Figure 2). Using 
the Mathematica Fit function, this quadratic 
relationship is defined by the equation: 

f{x) = 1.403 - 4.137X + 4.132x2 

Although the quadratic representation has maxi- 
mas at and 1 (representing GC-ratios of 0% and 
100% respectively), wild- type genomes rarely, 
if ever, reach GC-ratios of lesser than 25% or 
greater than 75%. The minima of the equa- 
tion, 0.5 (GC-content of 50%), exhibits the low- 
est palindromic density. 

Forty viral genomes were downloaded from 
GenBank in FASTA file format through the 
previously stated databases and their GC- 
contents and palindromic densities were calcu- 
lated. These values became ordered pairs (GC- 
content, palindromic density) and saved on a 
comma-separate values (csv) file which was plot- 
ted on Mathematica. Palindromic densities (or- 
dinate value) by GC-contents (abscissa value) 



were plotted and superimposed over the palin- 
dromic density by GC-content graph of ran- 
domly generated sequences (see Figure 2). 

It was observed that the majority of the GC- 
ratios of viral genomes were in the 40% range and 
almost all viral palindromic densities were below 
the curve of the random GC-content-palindromc 
ratio plot. 

4 Discussion 

The lowest number of palindrome sequences 
can be found in the randomly generated se- 
quences that have a GC-content of 50%. The 
number of palindromes increases as GC-content 
rises and falls. This is most likely due to the fact 
that less bases, thus less available combinations 
of bases, will allow for more palindromes. 

As demonstrated in by the viral genomes, 
there is a slight correlation between GC-content 
and the presence of palindromic sequences. How- 
ever, this relationship between GC-content and 
palindromic density varies per viral genome and 
there will always be outliers. This is because 
certain palindromes (such as restriction enzyme 
or methylation sites) are necessary for an or- 
ganism to function. Although GC-content and 
palindromes are used separately to differentiate 
species, GC-content seems to dictate the palin- 
dromic density of most viral genomes, evidenced 
by how viral points on Figure 2 seem to follow 
the random palindrome density curve albeit be- 
low it. 

Palindromic density is also directly af- 
fected by the presence of non-palindromic se- 
quences such as microsatellites or sequence mo- 
tifs that, as aforementioned, code for proteins 
that are necessary in the functioning of wild- 
type genomes. These DNA motifs and naturally- 
occurring sequences are not accounted for in ran- 
domly generated sequences. This accounts for 
why almost all viral palindrome densities exist 
under the random palindrome density curve. 

The degree at which certain viral genomes 
diverge from the random sequences' quadratic 
curve may provide details to a genomes' evolu- 
tion, usage of palindromes (such as for methy- 
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lation sites), and secondary structure stability. 
There has been record of cytosine methylation in 
mammahan DNA viruses, so viral genomes with 
a high GC-content and high palindrome density 
would have a high presence of CpG islands (sites 
of methylation) (Hoelzer et al., 2008). 

However, it seems that there are certain 
sequences whose GC-content has no affect on 
palindrome density. This occurence is most 
likely due to an increased number of necessary 
non-palindromic sequences. These genomes may 
either still exhibit conserved palindromes which 
lead to conserved secondary RNA structures, 
open reading frames, (non)coding regions in in- 
trons or exons of sequences (which may be highly 
conserved structures as well) , or viral microsatel- 
lites. Understanding microsatellites and their 
polymorphisms may be important in differenti- 
ating different strains of viruses — particularly 
herpes simplex strains (Deback et al., 2009). 

As observed, the majority of the sequences 
contain a GC-ratio around the 40% to 50% 
range. In such a situation, these genomes are 
described as being slightly AT-rich rather than 
GC-poor (Musto et al., 1997). The lower GC- 
content is due to the fact that some genomes 
are short (characteristic of many viruses) and 
the stop codon of short genomes has an AT- 
bias. In addition, low GC-content is character- 
istic of nonrecombining genomes, meaning that 
viral genomes with higher GC-content may be 
used as recombinant viruses for viral vaccines or 
gene therapy. 

Palindromes are also necessary in the forma- 
tion of secondary structures — many of which 
are conserved throughout viral genomes (Fekete, 
2000; Hofacker et al., 2004). Understanding of 
conserved secondary structures in viral genomes 
would help in establishing the evolution of these 



small genomes and would possibly lead to vac- 
cine discovery to target these conserved se- 
quences. 

For future research, the palindrome-finding 
algorithm can be expanded to search for palin- 
dromic sequences with spacer sequences in addi- 
tion to perfect palindromes. In addition, further 
research may be done in comaring the randomly 
generated GC-contents and palindrome densities 
(which are unchanging) to the genomes or genes 
of other species or more complex organisms. 

5 Conclusion 

GC-content and palindromic density share a 
non-linear, quadratic relationship with the low- 
est palindromic density at a GC-ratio of 50% 
and higher palindromic densities as GC-ratios 
approach extreme values. 

Most viral palindromic densities followed be- 
neath the random palindrome density curve. 
This is due to the fact that viral genomes require 
non-palindromic sequences such as open reading 
frames (ORF) or splicing sites which are nec- 
essary for the natural functioning of a genome 
and would not exist in randomly generated se- 
quences. 

In addition, the palindromic densities of vi- 
ral genomes may be used to trace the evolu- 
tion of viruses and through the understanding of 
conserved structures such as palindromes, future 
drug discoveries may use this to target certain 
viral genomes. 

Viral genomes with a high palindrome den- 
sity relative to the random palindrome density 
curve and a higher GC-ratio (>50%) may be 
used as recombinant viruses for future vaccines 
and gene therapy. 
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6 Tables and Figures 

Table 1: Randomly Generated Sequences with Corresponding Palindrome Count by GC-Content 



Length 


GC-contcnt 25% 


GC-content 50% 


GC-content 75% 


500 


293 


208 


298 


1000 


590 


418 


595 


1500 


890 


624 


896 


2000 


1189 


829 


1194 


3000 


1802 


1252 


1782 


5000 


2974 


2080 


2986 


7500 


4466 


3124 


4482 


10000 


5989 


4181 


5950 


15000 


8953 


6271 


8973 


20000 


11931 


8326 


11914 
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Figure 1: A plot of the number of palindromes by length of randomly generated sequences with GC-contents 
of 25%, 50%, and 75%. 
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Figure 2: A plot demonstrating the relation of Palindrome Density and GC-ratio of viral genomes compared 
to that of randomly generated sequences. 
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Table 2: Viral Genomes with Corresponding 


; Palindrome Count and GC-Content 


Species 


RefSeq 


Length 


GC-content 


Palindrome Density 


Eyach Virus VP12 Protien 


NC_003707.1 


678 


47.198% 


0.411 


Rotavirus A VP7 


NC_011503.2 


1062 


35.782% 


0.495 


Hepatitis Delta Virus 


NC_001653.2 


1682 


58.799% 


0.335 


Hantaan Virus 


NC_005218.2 


1696 


42.935% 


0.384 


HlNl Hemagglutinin 


NC_002017.1 


1778 


41.620% 


0.402 


Rotavirus B VP2 


NC_007549.1 


2969 


34.321% 


0.480 


Hepatitis B 


NC_003977.1 


3215 


49.269% 


0.365 


Adeno-associated Virus - 2 


NC_001401.2 


4679 


53.794% 


0.428 


Enterobacteria Phage #X174 


NC_001422.1 


5386 


44.764% 


0.396 


Maize Rayado Fine Virus 


NC_002786.1 


6305 


61.982% 


0.391 


Turnip Yellow Mosaic Virus 


NC_004063.1 


6318 


56.442% 


0.352 


Human Astrovirus 


NC_001943.1 


6813 


44.841% 


0.386 


Human Rhinovirus A 89 


NC_001617.1 


7152 


39.038% 


0.412 


Hepatitis E 


L08816.1 


7176 


57.943% 


0.458 


Grapevine Fleck Virus 


N_003347.1 


7564 


66.235% 


0.316 


Foot-and-mouth Diseas Virus Type 


N_004004.1 


8134 


55.274% 


0.392 


Cassava Vein Mosaic Virus 


N_001648.1 


8159 


24.930% 


0.504 


Aichi Virus 


NC_001918.1 


8251 


58.902% 


0.358 


Human T-lymphotropic Virus 1 


NC_001436.1 


8507 


53.462% 


0.404 


Human Immunodificiency Virus 1 


NC_001802.1 


9181 


42.120% 


0.401 


GB Virus C 


NC_001710.1 


9393 


59.082% 


0.420 


Hepatitis C 


NC_004102.1 


9464 


58.231% 


0.414 


Rubella Virus 


NC_001545.2 


9762 


69.596% 


0.505 


Human Immunodificiency Virus 2 


NC_001722.1 


10351 


45.661% 


0.382 


Louping 111 Virus 


NC_003690.1 


10871 


54.852% 


0.378 


Langat Virus 


NC_003690.1 


10943 


54.309% 


0.377 


Rabies Virus 


NC_001542.1 


11932 


45.097% 


0.386 


Japanese Encephalitis Virus 


NC_001437.1 


10976 


51.421% 


0.364 


Mayaro Virus 


NC_003417.1 


11411 


50.372% 


0.411 


Simian Foamy Virus 


NC_001364.1 


13246 


38.955% 


0.430 


Human Respiratory Syncytial Virus 


NC_001781.1 


15225 


47.427% 


0.402 


Human Parainfluenza Virus 1 


NC_003461.1 


15600 


33.557% 


0.449 


Measles Virus 


NC_001498.1 


15895 


47.427% 


0.402 


Nipah Virus 


NC_002728.1 


18246 


38.167% 


0.446 


Ebola Virus 


NC_002549.3 


18959 


41.073% 


0.441 


Marburg Virus 


NC_001608.3 


19111 


38.292% 


0.440 


Acidianus Bottle-shaped Virus 


NC_009452.1 


23814 


34.564% 


0.429 


Gill-associated Virus 


NC_010306.1 


26253 


46.235% 


0.385 


Human Coronavirus Virus NL63 


NC_005831.2 


27553 


34.461% 


0.429 


SARS Coronavirus 


NC_004718.3 


29751 


40.672% 


0.414 
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